Keyboard Biometrics

SVM Approach

Problem Background

The problem explored in this project is the identification of a given user based on their typing patterns. Here, we used 3 datasets (found in the data folder under the names 'keystroke0' (the CMU dataset), 'keystroke1' (the standard dataset), 'keystroke2' (the multi-password dataset), and 'keystroke3' (the demographics dataset). The latter 3 come from the GREYC lab.

In this project, we will run an SVM classifier on several configurations. First, we will attempt to use an n-class classifier, where n is the number of users in total for the dataset. Here we will try to precisely detect from the test set which of the n users it came from.

Next, we will run a two-class classifier, in an attempt to determine with a simple binary whether the input provided has come from the genuine user or from one of the impostors. This is a simpler problem because it does not need to be as specific, but it is more practical for real-world applications as keyboard biometrics are really only interesting for ensuring that the typist of the password is in fact the authenticated user.

Finally, we want to detect not only the proportion of time that the classifier gets the correct answer with respect to the genuine user and the impostor, but also to actually focus in on the false positives (when an impostor gets in) and the false negatives (when the genuine user is erroneously blocked from the system). For this portion, we will attempt to construct an ROC curve to visually present the true and false positive rates for the detections.

SVM Background

Support vector machines (SVMs) are supervised learning models. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm encodes every input as some p-dimensional vector and computes a (p-1)-dimensional hyperplane to separate these points into two groups. New datapoints are then assigned a similar p-dimensional vector which then falls one one or the other side of the hyperplane and the point is thus classified.

The SVM creates a margin spearating the data. The margin is defined by the distance from the hyperplane to the nearest point on either side of the hyperplane. If the data is linearly separable, it will be possible to create a "hard margin" meaning the data will be evenly split into two halves and there will be no data in the margin at all. This is a very good classifier but the data is not always this nice. More commonly, a "soft margin" is used which simply has a penalty factor on the points that fall within the margin and minimizes this penalty to determine the best hyperplane.

The hyperplane's shape is determined by the "kernel" function used. The kernel can be linear, polynomial, or using a Gaussian radial basis function (rbf). Using a kernel will morph the plane from a linear shape to some curve which will often allow for better classification depending on the type of data and the parameters of the curve.

For multi-class classification, we use the standard dual-class SVM but take a one vs all or a one vs one approach to detemine the one class that the given test data point fits best. In one vs all, n classifiers are created to compare the class x vs the other classes. The test point is run through each of these classifiers and the one with the highest probability of being in the class rather than being in the impostor group wins. In one vs one, each class is pitted against the others and a winner-takees-all strategy is employed until the best one is found.

SVM for Biometrics

SVMs have been proven good for text categorization problems and bioinformatics. They are also known to perform well for large datasets. They are best for linearly separable data, though it's not certain whether the datasets looked at in this project have this quality.



In [1]:

    
%matplotlib inline
import csv
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
import seaborn as sns



In [2]:

    
def score(clf, X_test, y_test):
    idx = 0
    score = 0
    for x in X_test:
        y_pred = clf.predict([x])
        if y_pred == y_test[idx]:
            score += 1
        idx+=1
    return 100*score/idx



In [3]:

    
def score_complex(clf,X_test,Y_test):
    
    count = 0
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    user = Y_test.count(1)
    impostor = Y_test.count(0)

    for x in X_test:
        
        y_hat = clf.predict([x])
        
        if Y_test[count] == 1 and y_hat == 1:
            tp += 1
        if Y_test[count] == 1 and y_hat == 0:
            fn += 1
        if Y_test[count] == 0 and y_hat == 1:
            fp += 1
        if Y_test[count] == 0 and y_hat == 0:
            tn += 1
        
        count += 1
    
    return (round(tp/user*100,2),round(tn/impostor*100,2),round(fp/impostor*100,2),round(fn/user*100,2))

Dataset 1 - CMU

Background

This is the basic dataset used in many keyboard biometrics studies. It contains 51 users and 20,400 total rows

Preprocessing the data



In [38]:

    
filename = "../data/keystroke0.csv"



In [39]:

    
X = []
Y = []

with open(filename, 'r') as f:
    reader = csv.reader(f)
    first = 1
    for row in reader:
        if first:
            first = 0
            continue
        X.append(row[3:])
        Y.append(int(row[0][1:]))

Multi-class SVM



In [31]:

    
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train)









    Out[31]:





SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)



In [7]:

    
print(str(score(clf, X_test, y_test)) + '%')









    



39.38725490196079%

Now we normalize the data with mean 0 and variance 1. We get a significantly better fit. We will normalize all datasets in this way from here on.



In [40]:

    
X = preprocessing.scale(X)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train) 

print(str(score(clf, X_test, y_test)) + '%')









    



88.13725490196079%



In [9]:

    
# Test different Csprint(str(score(clf, X_test, y_test)) + '%')
results = []
Cs = [1,25,50,75,100]

for i in [1,25,50,75,100]:
    results.append(score(svm.SVC(C = i).fit(X_train, y_train), X_test, y_test))
    
print(Cs)
print(results)









    



[1, 25, 50, 75, 100]
[88.13725490196079, 90.12254901960785, 90.0, 89.90196078431373, 89.95098039215686]



In [10]:

    
# Test different kernels

for i in ['linear','rbf','poly','sigmoid']:
    print(i + ': ' + str(score(svm.SVC(kernel = i).fit(X_train, y_train), X_test, y_test)) + '%')









    



linear: 84.8529411764706%
rbf: 88.13725490196079%
poly: 73.72549019607843%
sigmoid: 1.3235294117647058%

RBF is clearly the best kernel in this case (which is also the default), so we will use it for all remaining calculations on this data set.

Genuine vs Impostor SVM

Clearly, SVM is fairly accurate at identifying the 57 users in the database based on their keyboard biometrics. However, much more important that this metric is the algorithm's ability to distinguish one specified user from the other 56, as would be the case if the user wanted to use this biometric as a means of authenticating his/her identity before logging into a computer, phone, etc.



In [11]:

    
genuines = [2, 5, 13, 39]
ys = []
for g in genuines:
    # Genuine = 1, Impostor = 0
    newY = []
    for elt in Y:
        if elt == g:
            newY.append(1)
        else:
            newY.append(0)
    ys.append(newY)



In [12]:

    
# Make sure we have exactly 400 1s in each list
for i in range(len(genuines)):
    print(sum(ys[i]))



In [13]:

    
scores = []
for y in ys:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    clf = svm.SVC(C = 1)
    clf.fit(X_train, y_train) 
    scores.append(str(score(clf, X_test, y_test)) + '%')

print(scores)









    



['99.09313725490196%', '99.63235294117646%', '99.5343137254902%', '99.6078431372549%']

VERY Weird. or maybe not that weird... they're not exactly the same, just very close, which kinda makes sense given what you were saying the other day. It's going to be almost entirely correct because there are so many 0s relative to 1s.



In [41]:

    
true_positives = []
false_negatives = []
true_negatives = []
false_positives = []

count = 0

for i in [x for x in range(2,58) if x in Y]:
    
    count += 1

    user = [1 if x == i else 0 for x in Y]
    
    X_train, X_test, Y_train, Y_test = train_test_split(X, user, test_size=0.2, random_state=42)
    
    clf = svm.SVC()
    clf.fit(X_train,Y_train)
        
    scores = score_complex(clf,X_test,Y_test)
    
    true_positives.append(scores[0])
    false_negatives.append(scores[3])
    true_negatives.append(scores[1])
    false_positives.append(scores[2])

    # print('SUBJECT' + str(i) + ': \n True-Positive = ' + str(scores[0]) + '% \n False-Negative = ' + str(scores[3]) + '% \n True-Negative = ' + str(scores[1]) + '% \n False-Positive = ' + str(scores[2]) + '%')
    
print(count)



In [54]:

    
plt.figure()
sns.boxplot(true_positives,color = 'turquoise')
plt.title('True-Positive Rates for All Users')
plt.xlabel('TPR')
plt.show()

plt.figure()
sns.boxplot(false_positives,color = 'r')
plt.title('False-Positive Rates for All Users')
plt.xlabel('FPR')
plt.show()

Using SVM to validate the identity of each of the users against the others generates some interesting results. The TPR (i.e. the percentage of time the true user is attempting to log in and is identified as such) takes on quite a large range of values between different users, with a mean of ~78%. This means that 1/5 times the true user will be rejected from his/her own account, which is less than idea. The complimentary metirc, FPR (i.e. the percentage of time an impostor is attempting to log in as someone else and is incorrectly identified as the true user) takes on a much narrower range of values, and thankfully, has a much lower mean. The mean is ~5%, meaning that only 1/20 times will an impostor be able to get into someone else's account, and the data is left-skewed. This is still not a terrific metric, but is not bad considering the naivity of the approach.

ROC Curve

Below is the basic code for creating some ROC curves. In this test I just made 6 different "folds", or cross-validation splits and compared them against each other. Ideally we would be comparing different C values, kernels, and datasets using this



In [16]:

    
# Gonna just use the first y since it doesn't seem to matter
yyy = np.array(ys[0])
XXX = np.array(X)

# More intense cross-validation
cv = StratifiedKFold(yyy, n_folds=6)
classifier = svm.SVC(kernel='linear', probability=True)

mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
all_tpr = []

for i, (train, test) in enumerate(cv):
    probas_ = classifier.fit(XXX[train], yyy[train]).predict_proba(XXX[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(yyy[test], probas_[:, 1])
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=1, label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')

mean_tpr /= len(cv)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, 'k--',
         label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()

Dataset 2 - GREYC Standard

Background

This is the standard GREYC dataset. It contains 133 users and 7,555 total rows

Preprocessing the data



In [17]:

    
filename = "../data/keystroke1.csv"



In [18]:

    
X = []
y = []

with open(filename, 'r') as f:
    reader = csv.reader(f)
    first = 1
    for row in reader:
        if first:
            first = 0
            continue
        rw = list(map(lambda x: x/10000000., (map(int, row[5].split(' ')[1:61]))))
        if len(rw) < 60:
            continue
        X.append(rw)
        y.append(int(row[0]))

X = preprocessing.scale(X)

Multi-class SVM



In [19]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train)









    Out[19]:





SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

This actually takes a stupid long time to run..



In [20]:

    
print(str(score(clf, X_test, y_test)) + '%')

Well this is a bit awkward.. The data is very strange - I divided everything by 10,000,000 because the numbers were huge (I suspect they were in milliseconds), but still unclear why this took so long and yielded 0..

Dataset 4 - GREYC Demographics

Background

This is the GREYC dataset where users are also classfified by demographics. It contains 110 users and each has a binary class assigned to them based on age (younger or older than 30), gender, and handedness.

The data is actually split into 2 files - 'keystroke3-1.csv' and 'keystroke3-2.csv'

Each of these has the typing of a different password

Preprocessing the data



In [21]:

    
filename = "../data/keystroke3-1.csv"



In [22]:

    
X = []
Y = []

gender = []
age = []
handedness = []

with open(filename, 'r') as f:
    reader = csv.reader(f)
    first = 1
    for row in reader:
        if first:
            first = 0
            continue
        rw = list(map(lambda x: x/10000000., (map(int, row[6].split(' ')[1:61]))))
        if len(rw) < 5:
            print(rw)
        
        X.append(rw)
        Y.append(int(row[0]))
        
        gender.append(row[1])
        age.append(int(row[2]))
        handedness.append(row[3])

# normalize with mean 0 and variance 1
X = preprocessing.scale(X)

# 1 = female, 0 = male
gender = [0 if x == 'M' else 1 for x in gender]

# 1 = right, 0 = left
handedness = [0 if x == 'L' else 1 for x in handedness]

Data set demographics:



In [23]:

    
print('Right-Handed: ' + str(np.round(handedness.count(0)/len(handedness),2)*100) + '%')
print('Male: ' + str(np.round(gender.count(0)/len(gender),2)*100) + '%')

plt.figure()
plt.hist(age,bins = 15)
plt.title('Age of Subjects')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()









    



Right-Handed: 11.0%
Male: 71.0%



In [24]:

    
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
clf = svm.SVC(C = 1)
clf.fit(X_train, y_train) 

print(str(score(clf, X_test, y_test)) + '%')

This one goes a lot faster but still.. 3%, what? -- better after normalizing but still really low



In [25]:

    
# test different kernels

for i in ['linear','rbf','poly','sigmoid']:
    print(i + ': ' + str(score(svm.SVC(kernel = i).fit(X_train, y_train), X_test, y_test)) + '%')









    



linear: 41.36363636363637%
rbf: 17.5%
poly: 5.0%
sigmoid: 0.0%

In contrast to the first data set, it appears as though a linear kernel is more appropriate for this data set, so it will be used in all further calculations.

Running into problems here... data is too limited per user. We can try a hardcode solution, but might not be worth it.



In [26]:

    
true_positives = []
false_negatives = []
true_negatives = []
false_positives = []

for i in [x for x in range(1,max(Y)) if x in Y]:

    user = [1 if x == i else 0 for x in Y]
    
    X_train, X_test, Y_train, Y_test = train_test_split(X, user, test_size=0.2, random_state=42)
    
    if Y_test.count(1) == 0:
        print(i)
    
    clf = svm.SVC(kernel = 'linear')
    clf.fit(X_train,Y_train)
        
    scores = score_complex(clf,X_test,Y_test)
    
    true_positives.append(scores[0])
    false_negatives.append(scores[3])
    true_negatives.append(scores[1])
    false_positives.append(scores[2])

    # print('SUBJECT' + str(i) + ': \n True-Positive = ' + str(scores[0]) + '% \n False-Negative = ' + str(scores[3]) + '% \n True-Negative = ' + str(scores[1]) + '% \n False-Positive = ' + str(scores[2]) + '%')









    



8






    



---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-26-2b496c7b1890> in <module>()
     16     clf.fit(X_train,Y_train)
     17 
---> 18     scores = score_complex(clf,X_test,Y_test)
     19 
     20     true_positives.append(scores[0])

<ipython-input-3-931e35f68234> in score_complex(clf, X_test, Y_test)
     25         count += 1
     26 
---> 27     return (round(tp/user*100,2),round(tn/impostor*100,2),round(fp/impostor*100,2),round(fn/user*100,2))

ZeroDivisionError: division by zero

Dataset 3 - GREYC Multi-password

Background

This is the GREYC dataset where users typed multiple passwords. It contains 118 users, each with a unique password. This dataset is complex in its organization and should only be used as a stretch goal - for each password it has the user info for that password and also impostor info for trying to get at the password. Everything here lives in keystroke3.tar.gz

The dataset is here and the associated paper that uses this dataset can he found here

It's not worth looking into this until we've done good work on the other datasets



In [ ]:

IGNORE BELOW THIS POINT, ALL SCRATCHWORK



In [ ]:

    
import csv
import numpy as np
import sqlite3
from sklearn import svm
from sklearn.cross_validation import train_test_split



In [ ]:

    
filename = "../data/DSL-StrongPasswordData.csv"



In [ ]:

    
specs = []
users = []

with open(filename, 'rb') as f:
    reader = csv.reader(f)
    first = 1
    for row in reader:
        if first:
            first = 0
            continue
        specs.append(row[3:])
        users.append(int(row[0][1:]))



In [ ]:

    
users



In [ ]:

    
specs



In [ ]:

    
X_train, X_test, y_train, y_test = train_test_split(specs, users, test_size=0.2, random_state=42)



In [ ]:

    
y_train



In [ ]:

    
X_train, X_test, y_train, y_test = train_test_split(specs, users, test_size=0.2, random_state=42)
clf = svm.SVC()
clf.fit(X_train, y_train)



In [ ]:

    
idx = 0
score = 0
for x in X_test:
    y_pred = clf.predict([x])
    if y_pred == y_test[idx]:
        score += 1
    idx+=1
print str(100*score/idx) + "%"



In [ ]:

    
print score, idx



In [ ]:

    
np_specs = np.array(X_train).astype(np.float).ravel()
np_users = np.array(y_train)
print np_specs
print np_users
h = 0.02
x_min, x_max = np_specs.min() - 15, np_specs.max() + 16
y_min, y_max = np_users.min() - 15, np_users.max() + 16
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())



In [ ]:

    
specs[:3]



In [ ]:

    
X_train, X_test, y_train, y_test = train_test_split(specs, users, test_size=0.2, random_state=42)
clf = svm.SVC()
clf.fit(X_train, y_train)